Relevance and Overlap Aware Text Collection Selection

نویسندگان

Thomas Hernandez

Subbarao Kambhampati

چکیده

In an environment of distributed text collections, the first step in the information retrieval process is to identify which of all available collections are more relevant to a given query and should thus be accessed to answer the query. Collection selection is difficult due to the varying relevance of sources as well as the overlap between these sources. Previous collection selection methods have considered relevance of the collections but have ignored overlap among collections. They thus make the unrealistic assumption that the collections are all effectively disjoint. In this paper, we describe ROSCO, an approach for collection selection which handles collection relevance as well as overlap. We start by developing methods for estimating the statistics concerning size, relevance, and overlap that are necessary to support collection selection. We then explain how ROSCO selects text collections based upon these statistics. Finally, we demonstrate the effectiveness of ROSCO by comparing it to major text collection selection algorithms (CORI and ReDDE) under a variety of scenarios.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Relevance and Overlap in Text Resource Selection

متن کامل

Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections

متن کامل

Integrated Clustering and Feature Selection Scheme for Text Documents

Problem statement: Text documents are the unstructured databases that contain raw data collection. The clustering techniques are used group up the text documents with reference to its similarity. Approach: The feature selection techniques were used to improve the efficiency and accuracy of clustering process. The feature selection was done by eliminate the redundant and irrelevant items from th...

متن کامل

Efficient Time-Travel on Versioned Text Collections

The availability of versioned text collections such as the Internet Archive opens up opportunities for time-aware exploration of their contents. In this paper, we propose time-travel retrieval and ranking that extends traditional keyword queries with a temporal context in which the query should be evaluated. More precisely, the query is evaluated over all states of the collection that existed d...

متن کامل

Review on Text Clustering Using Statistical and Semantic Data

The explosive growth of information stored in unstructured texts created a great demand for new and powerful tools to acquire useful information, such as text mining. Document clustering is one of its the powerful methods and by which document retrieval, organization and summarization can be achieved. Text documents are the unstructured databases that contain raw data collection. The clustering...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Relevance and Overlap Aware Text Collection Selection

نویسندگان

چکیده

منابع مشابه

Relevance and Overlap in Text Resource Selection

Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections

Integrated Clustering and Feature Selection Scheme for Text Documents

Efficient Time-Travel on Versioned Text Collections

Review on Text Clustering Using Statistical and Semantic Data

عنوان ژورنال:

اشتراک گذاری